Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

A search engine for Arabic documents

Identifieur interne : 000B04 ( Main/Exploration ); précédent : 000B03; suivant : 000B05

A search engine for Arabic documents

Auteurs : T. Sari [Algérie] ; A. Kefali [Algérie]

Source :

RBID : Hal:hal-00334402

Descripteurs français

English descriptors

Abstract

This paper is an attempt for indexing and searching degraded document images without recognizing the textual patterns and so to circumvent the cost and the laborious effort of OCR technology. The proposed approach deal with textual-dominant documents either handwritten or printed. From preprocessing and segmentation stages, all the connected components (CC) of the text are extracted applying a bottom-up approach. Each CC is then represented with global indices such as loops, ascenders, etc. Each document will be associated an ASCII file of the codes from the extracted features. Since there is no feature extraction technique reliable enough to locate all the discriminant global indices modelling handwriting or degraded prints, we apply an approximate string matching technique based on Levenshtein distance. As a result, the search module can efficiently cope with imprecise and incomplete pattern descriptions. The test was performed on some Arabic historical documents and shown good performances.

Url:


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">A search engine for Arabic documents</title>
<author>
<name sortKey="Sari, T" sort="Sari, T" uniqKey="Sari T" first="T." last="Sari">T. Sari</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-81739" status="VALID">
<orgName>Laboratoire de gestion electronique de documents [Annaba]</orgName>
<orgName type="acronym">LabGED</orgName>
<desc>
<address>
<country key="DZ"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-300650" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-300650" type="direct">
<org type="institution" xml:id="struct-300650" status="VALID">
<orgName>Université Badji Mokhtar [Annaba]</orgName>
<desc>
<address>
<addrLine>BP 12, 23000, Annaba</addrLine>
<country key="DZ"></country>
</address>
<ref type="url">http://www.univ-annaba.dz/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>Algérie</country>
</affiliation>
</author>
<author>
<name sortKey="Kefali, A" sort="Kefali, A" uniqKey="Kefali A" first="A." last="Kefali">A. Kefali</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-81739" status="VALID">
<orgName>Laboratoire de gestion electronique de documents [Annaba]</orgName>
<orgName type="acronym">LabGED</orgName>
<desc>
<address>
<country key="DZ"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-300650" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-300650" type="direct">
<org type="institution" xml:id="struct-300650" status="VALID">
<orgName>Université Badji Mokhtar [Annaba]</orgName>
<desc>
<address>
<addrLine>BP 12, 23000, Annaba</addrLine>
<country key="DZ"></country>
</address>
<ref type="url">http://www.univ-annaba.dz/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>Algérie</country>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">HAL</idno>
<idno type="RBID">Hal:hal-00334402</idno>
<idno type="halId">hal-00334402</idno>
<idno type="halUri">https://hal.archives-ouvertes.fr/hal-00334402</idno>
<idno type="url">https://hal.archives-ouvertes.fr/hal-00334402</idno>
<date when="2008-10">2008-10</date>
<idno type="wicri:Area/Hal/Corpus">000014</idno>
<idno type="wicri:Area/Hal/Curation">000014</idno>
<idno type="wicri:Area/Hal/Checkpoint">000120</idno>
<idno type="wicri:Area/Main/Merge">000B15</idno>
<idno type="wicri:Area/Main/Curation">000B04</idno>
<idno type="wicri:Area/Main/Exploration">000B04</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">A search engine for Arabic documents</title>
<author>
<name sortKey="Sari, T" sort="Sari, T" uniqKey="Sari T" first="T." last="Sari">T. Sari</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-81739" status="VALID">
<orgName>Laboratoire de gestion electronique de documents [Annaba]</orgName>
<orgName type="acronym">LabGED</orgName>
<desc>
<address>
<country key="DZ"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-300650" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-300650" type="direct">
<org type="institution" xml:id="struct-300650" status="VALID">
<orgName>Université Badji Mokhtar [Annaba]</orgName>
<desc>
<address>
<addrLine>BP 12, 23000, Annaba</addrLine>
<country key="DZ"></country>
</address>
<ref type="url">http://www.univ-annaba.dz/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>Algérie</country>
</affiliation>
</author>
<author>
<name sortKey="Kefali, A" sort="Kefali, A" uniqKey="Kefali A" first="A." last="Kefali">A. Kefali</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-81739" status="VALID">
<orgName>Laboratoire de gestion electronique de documents [Annaba]</orgName>
<orgName type="acronym">LabGED</orgName>
<desc>
<address>
<country key="DZ"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-300650" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-300650" type="direct">
<org type="institution" xml:id="struct-300650" status="VALID">
<orgName>Université Badji Mokhtar [Annaba]</orgName>
<desc>
<address>
<addrLine>BP 12, 23000, Annaba</addrLine>
<country key="DZ"></country>
</address>
<ref type="url">http://www.univ-annaba.dz/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>Algérie</country>
</affiliation>
</author>
</analytic>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="mix" xml:lang="en">
<term>Arabic handwriting recognition</term>
<term>Document retrieval</term>
<term>handwriting segmentation</term>
<term>handwriting segmentation.</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr">
<term>Recherche documentaire</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">This paper is an attempt for indexing and searching degraded document images without recognizing the textual patterns and so to circumvent the cost and the laborious effort of OCR technology. The proposed approach deal with textual-dominant documents either handwritten or printed. From preprocessing and segmentation stages, all the connected components (CC) of the text are extracted applying a bottom-up approach. Each CC is then represented with global indices such as loops, ascenders, etc. Each document will be associated an ASCII file of the codes from the extracted features. Since there is no feature extraction technique reliable enough to locate all the discriminant global indices modelling handwriting or degraded prints, we apply an approximate string matching technique based on Levenshtein distance. As a result, the search module can efficiently cope with imprecise and incomplete pattern descriptions. The test was performed on some Arabic historical documents and shown good performances.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>Algérie</li>
</country>
</list>
<tree>
<country name="Algérie">
<noRegion>
<name sortKey="Sari, T" sort="Sari, T" uniqKey="Sari T" first="T." last="Sari">T. Sari</name>
</noRegion>
<name sortKey="Kefali, A" sort="Kefali, A" uniqKey="Kefali A" first="A." last="Kefali">A. Kefali</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000B04 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000B04 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Hal:hal-00334402
   |texte=   A search engine for Arabic documents
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024